Visualizing Player Travel in the ATP Tour

The ATP and WTA Tours are, at their core, world tours. And yet, for all the tennis visualizations out there, I've seldom seen a visualization of players moving en masse across the planet. Here I display two different ways to visualize the movement of players between tournaments: a Sankey diagram, and a good ol' fashioned map of the world. These visualizations are for the 2017 ATP World Tour.

While neither visualization is perfect, they play complimentary roles. The Sankey diagram does an excellent job of showing the temporal sequence of tournaments and the quantities of players moving between them. The one thing it doesn't really capture is geography; much of the overall tournament schedule and individual players' calendars is based on distance and ease of travel between tournaments. The world map is a less interesting stand-alone figure, but it is a useful complement to the Sankey diagram.

Part 1: Data Wrangling

I use two data sources to create these visualizations:

  1. The first is match result data from Jeff Sackmann, creator of Tennis Abstract, the Match Charting Project, and the blog Heavy Topspin. This data is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. I use it to determine the tournaments played by each player. Once I've created each individual's tournament schedule, I can aggregate them to create groups of players traveling between tournaments.

  2. The second is a free database of cities and their latitudes and longitudes from SimpleMaps.com. I use the latitude and longitude information to plot the cities and travel paths of players on a world map.

I've shown below how I create the input data, but if you're not interested in the intricacies of data wrangling, feel free to skip straight to Part 2.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# modules for Sankey diagram
import plotly as py
py.offline.init_notebook_mode(connected=True)
import plotly.io as pio
from IPython.display import Image

#modules for World Map
import cartopy
import cartopy.crs as ccrs
In [2]:
path = '/Users/admin/Documents/Personal_Data_Project/2_Tennis/0_Raw_Input_Data/3_match_results/tennis_atp_master'
tour_matches = pd.read_csv(path + '/atp_matches_2017.csv')

print('There are '
      + str(len(tour_matches))
      + ' matches present, with a total of '
      + str(len(tour_matches.columns))
      + ' columns. The earliest date is '
      + str(tour_matches['tourney_date'].min())
      + ' and the last date is '
      + str(tour_matches['tourney_date'].max())
      + '.'
     )
tour_matches.columns
There are 2886 matches present, with a total of 49 columns. The earliest date is 20170102 and the last date is 20171124.
Out[2]:
Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

I stick to just 10 columns from the tour_matches dataframe, which can be classified into three groups.

In [3]:
tourney_columns = ['tourney_name', 'surface', 'tourney_level', 'tourney_date']
winner_columns = ['winner_id', 'winner_name', 'winner_rank']
loser_columns = ['loser_id', 'loser_name', 'loser_rank']
tour_matches.loc[:, tourney_columns + winner_columns + loser_columns].head()
Out[3]:
tourney_name surface tourney_level tourney_date winner_id winner_name winner_rank loser_id loser_name loser_rank
0 Brisbane Hard A 20170102 105777 Grigor Dimitrov 17.0 105453 Kei Nishikori 5.0
1 Brisbane Hard A 20170102 105777 Grigor Dimitrov 17.0 105683 Milos Raonic 3.0
2 Brisbane Hard A 20170102 105453 Kei Nishikori 5.0 104527 Stanislas Wawrinka 4.0
3 Brisbane Hard A 20170102 105683 Milos Raonic 3.0 104745 Rafael Nadal 9.0
4 Brisbane Hard A 20170102 105777 Grigor Dimitrov 17.0 106233 Dominic Thiem 8.0

Below I show the dataframe containing city locations. Many cities share the same name, but I just use the most populated city for each name. Fortunately the major tennis tournaments are hosted at the most populated city for each city name.

In [4]:
path = '/Users/admin/Documents/Personal_Data_Project/2_Tennis/0_Raw_Input_Data/4_geography'
cities = pd.read_csv(path + '/worldcities.csv')
cities = (cities
          .sort_values(by=['population'])
          .drop_duplicates(subset='city_ascii', keep='last')
         )
cities.head()
Out[4]:
city city_ascii lat lng country iso2 iso3 admin_name capital population id
3554 Chernobyl Chernobyl 51.3894 30.0989 Ukraine UA UKR Kyyivs’ka Oblast’ NaN 0.0 1804043438
2604 Logashkino Logashkino 70.8504 153.9000 Russia RU RUS Sakha (Yakutiya) NaN 0.0 1643050775
2570 Ambarchik Ambarchik 69.6510 162.3336 Russia RU RUS Sakha (Yakutiya) NaN 0.0 1643739159
5332 Ennadai Ennadai 61.1333 -100.8833 Canada CA CAN Nunavut NaN 0.0 1124019423
2512 Nordvik Nordvik 74.0165 111.5100 Russia RU RUS Krasnoyarskiy Kray NaN 0.0 1643587468

Constructing Player Schedules

As we saw with the tour matches dataframe, tournament attendance information for each player is split into the winner and loser columns. Here I split the dataframe into the the winner and loser sides and then concatenate the two. That way I have all matches for each player on a different row, regardless of the outcome of the match. (That way I don't miss a tournament where someone lost in the first round).

In [5]:
player_columns = ['player_id', 'player_name', 'player_rank']

player_schedules = pd.concat([
                        (tour_matches.loc[:, tourney_columns + winner_columns]
                             .rename(columns=dict(zip(winner_columns, player_columns)))),
                        (tour_matches.loc[:, tourney_columns + loser_columns]
                             .rename(columns=dict(zip(loser_columns, player_columns))))
                        ])

Next, I need to prepare the player_schedules dataframe for merging with city location data. There's a few broad categories of tasks here:

  1. I eliminate Davis Cup matches from the dataset. This is mostly for convenience and to keep the number of events at a reasonable number for display.
  2. The spelling for some tournaments doesn't match the spelling of the city in the location data, or it doesn't look good for display.
  3. Some tournament names are either not locations or not a city name in the location data. I have identified the nearest city in the location data for these tournaments on an ad hoc basis. I'm not an expert on geography, so there are likely more precise substitutes available.
In [6]:
#drop davis cup from schedules
player_schedules = player_schedules.loc[
                    ~player_schedules.tourney_name.str.contains('Davis Cup'), :]

#cleaning event/city names
tourney_name_cleaning_dict = {"'S-Hertogenbosch" : "'s-Hertogenbosch",
                              'Marrakech' : 'Marrakesh',
                              'Rio De Janeiro' : 'Rio de Janeiro',
                              'Us Open' : 'US Open',
                              'Canada Masters' : 'Montreal',
                              'Beijing ' : 'Beijing'
                             }
player_schedules.tourney_name = player_schedules.tourney_name.replace(
                                    tourney_name_cleaning_dict)

player_schedules['location'] = player_schedules.tourney_name
player_schedules.loc[(player_schedules.tourney_name == 'London')
                     & (player_schedules.surface == 'Grass'),
                     'tourney_name'] = "Queen's Club"
player_schedules['location'] = player_schedules['location'].str.replace(' Masters', '')
                     
location_cleaning_dict = {'Australian Open' : 'Melbourne',
                          'Roland Garros' : 'Paris',
                          'Wimbledon' : 'London',
                          'US Open' : 'Queens',
                          'Antwerp' : 'Antwerpen',
                          'Monte Carlo' : 'Monaco',
                          'Estoril' : 'Lisbon',
                          'Halle' : 'Bielefeld',
                          'Eastbourne' : 'Brighton',
                          'Bastad' : 'Halmstad',
                          "Queen's Club" : 'London',
                          'Umag' : 'Trieste',
                          'Gstaad' : 'Sion',
                          'Kitzbuhel' : 'Innsbruck',
                          'Los Cabos' : 'Cabo San Lucas'
                         }
player_schedules.location = player_schedules.location.replace(location_cleaning_dict)

player_schedules.loc[player_schedules.tourney_name != player_schedules.location].head()
Out[6]:
tourney_name surface tourney_level tourney_date player_id player_name player_rank location
139 Australian Open Hard G 20170116 104918 Andy Murray 1.0 Melbourne
140 Australian Open Hard G 20170116 126094 Andrey Rublev 152.0 Melbourne
141 Australian Open Hard G 20170116 200282 Alex De Minaur 301.0 Melbourne
142 Australian Open Hard G 20170116 105023 Sam Querrey 32.0 Melbourne
143 Australian Open Hard G 20170116 104545 John Isner 19.0 Melbourne

Now I merge the player schedules with city locations. I never actually use the country variable, but it is helpful to keep around for sense-checking the data and ensuring I selected the proper city.

In [7]:
player_schedules = player_schedules.merge(
                        cities.loc[:, ['city_ascii', 'lat', 'lng', 'country',]],
                        left_on='location',
                        right_on='city_ascii',
                        how='left').drop(columns=['city_ascii'])
player_schedules.head()
Out[7]:
tourney_name surface tourney_level tourney_date player_id player_name player_rank location lat lng country
0 Brisbane Hard A 20170102 105777 Grigor Dimitrov 17.0 Brisbane -27.455 153.0351 Australia
1 Brisbane Hard A 20170102 105777 Grigor Dimitrov 17.0 Brisbane -27.455 153.0351 Australia
2 Brisbane Hard A 20170102 105453 Kei Nishikori 5.0 Brisbane -27.455 153.0351 Australia
3 Brisbane Hard A 20170102 105683 Milos Raonic 3.0 Brisbane -27.455 153.0351 Australia
4 Brisbane Hard A 20170102 105777 Grigor Dimitrov 17.0 Brisbane -27.455 153.0351 Australia

I'm going to construct a dataframe of transitions between events. Each row will represent one or more players playing two tournaments consecutively. The Sankey diagram in particular will look better with a node for the start and end of each player's season. Below I insert a row for the start and end of the season for each player.

In [8]:
player_starts = player_schedules.loc[:, player_columns[0:-1]].drop_duplicates()
player_starts['tourney_name'] = 'Start of Season'
player_starts['tourney_date'] = 20170101

player_ends = player_schedules.loc[:, player_columns[0:-1]].drop_duplicates()
player_ends['tourney_name'] = 'End of Season'
player_ends['tourney_date'] = 20171231

player_schedules = pd.concat([player_starts, player_ends, player_schedules], sort=False)
player_schedules = (player_schedules
                    .sort_values(by=['player_id', 'tourney_date'])
                    .drop_duplicates()
                    .reset_index(drop=True))
player_schedules.head()
Out[8]:
player_id player_name tourney_name tourney_date surface tourney_level player_rank location lat lng country
0 100644 Alexander Zverev Start of Season 20170101 NaN NaN NaN NaN NaN NaN NaN
1 100644 Alexander Zverev Australian Open 20170116 Hard G 24.0 Melbourne -37.8200 144.975 Australia
2 100644 Alexander Zverev Montpellier 20170206 Hard A 21.0 Montpellier 43.6104 3.870 France
3 100644 Alexander Zverev Rotterdam 20170213 Hard A 18.0 Rotterdam 51.9200 4.480 Netherlands
4 100644 Alexander Zverev Marseille 20170220 Hard A 18.0 Marseille 43.2900 5.375 France

Lastly, I use the shfit function to create a group of columns for the next tournament each player attends. Now each row represents a pair of tournaments attended by the player, and each tournament appears twice (once as the "current" tournament and once as the next tournament.

In [9]:
player_nexts = (player_schedules
                .groupby(['player_id', 'player_name'])
                ['tourney_name', 'location', 'lat', 'lng', 'country'].shift(-1)
               )
player_nexts.columns = 'next_' + player_nexts.columns
player_schedules = player_schedules.join(player_nexts)
player_schedules.head()
Out[9]:
player_id player_name tourney_name tourney_date surface tourney_level player_rank location lat lng country next_tourney_name next_location next_lat next_lng next_country
0 100644 Alexander Zverev Start of Season 20170101 NaN NaN NaN NaN NaN NaN NaN Australian Open Melbourne -37.8200 144.9750 Australia
1 100644 Alexander Zverev Australian Open 20170116 Hard G 24.0 Melbourne -37.8200 144.975 Australia Montpellier Montpellier 43.6104 3.8700 France
2 100644 Alexander Zverev Montpellier 20170206 Hard A 21.0 Montpellier 43.6104 3.870 France Rotterdam Rotterdam 51.9200 4.4800 Netherlands
3 100644 Alexander Zverev Rotterdam 20170213 Hard A 18.0 Rotterdam 51.9200 4.480 Netherlands Marseille Marseille 43.2900 5.3750 France
4 100644 Alexander Zverev Marseille 20170220 Hard A 18.0 Marseille 43.2900 5.375 France Indian Wells Masters Indian Wells 33.7036 -116.3396 United States

Constructing Transitions Between Tournaments

Below I define three functions to create final inputs for my data visualizations. The Sankey diagram will require two dataframes: a dataframe of tournaments for the nodes and a dataframe of tournament transitions for the links. The transitions dataframe is different from the player schedules dataframe because it has just one row for each pair of tournaments (multiple players traveling from Melbourne to Montpellier are represented in a single row).

Tournaments and transitions between them will be color coded by surface and event level (Slams, Masters 1000s, and 250s/500s), and those colors are assigned here as well.

In [10]:
def assign_tournament_color(df):
    df['color'] = df.color(df.color_index)
    return df
In [11]:
def construct_tournament_list(df):

    _df_out = df.loc[:, tourney_columns]
    
    _df_out = (_df_out
               .drop_duplicates()
               .sort_values(by=['tourney_date', 'tourney_name'])
               .reset_index(drop=True)
              )
    
    cmap_dict = {'Hard' : plt.cm.get_cmap('Blues'),
                 'Clay' : plt.cm.get_cmap('Reds'),
                 'Grass': plt.cm.get_cmap('Greens'),
                 np.nan : plt.cm.get_cmap('Greys')
                }
    
    level_dict = {'G' : 0.9,
                  'M' : 0.6,
                  'A' : 0.3,
                  np.nan : 0.5
                 }

    _df_out['color'] = _df_out.surface.map(cmap_dict)
    _df_out['color_index'] = _df_out.tourney_level.map(level_dict)
    _df_out = _df_out.apply(assign_tournament_color, axis=1)
    _df_out['color'] = _df_out.color.str[0:3] + (tournament_alpha,)

    return _df_out.drop(columns=['color_index'])
In [12]:
def construct_transitions(df):
    df_tourneys = construct_tournament_list(df)
    
    id_dict = dict(zip(df_tourneys.tourney_name, df_tourneys.index))
    color_dict = dict(zip(df_tourneys.tourney_name, df_tourneys.color))
    df_trans = (df
                .fillna('Unknown')
                .groupby(['tourney_date', 'tourney_name',
                          'location', 'lat', 'lng', 'country',
                          'next_tourney_name', 'next_location',
                          'next_lat', 'next_lng', 'next_country'
                         ])
                .size()
                .reset_index()
                .replace(to_replace={'Unknown' : np.nan})
                .rename(columns={0 : 'num_players'})
               )
    df_trans['origin_color'] = (df_trans.tourney_name.map(color_dict).str[0:3]
                                + (transition_alpha,))
    df_trans_ids = df_trans.replace(id_dict)
    return df_tourneys, df_trans, df_trans_ids
In [13]:
tournament_alpha = 1
transition_alpha = 0.5

peak_ranks = player_schedules.groupby('player_name')['player_rank'].min()

player_list = peak_ranks[peak_ranks <= 20].index
selection = player_schedules.loc[player_schedules.player_name.isin(player_list), :]
tournaments, transitions, transition_ids = construct_transitions(selection)

tournaments.head()
Out[13]:
tourney_name surface tourney_level tourney_date color
0 Start of Season NaN NaN 20170101 (0.586082276047674, 0.586082276047674, 0.58608...
1 Brisbane Hard A 20170102 (0.7161860822760477, 0.8332026143790849, 0.916...
2 Chennai Hard A 20170102 (0.7161860822760477, 0.8332026143790849, 0.916...
3 Doha Hard A 20170102 (0.7161860822760477, 0.8332026143790849, 0.916...
4 Auckland Hard A 20170109 (0.7161860822760477, 0.8332026143790849, 0.916...
In [14]:
transitions.head()
Out[14]:
tourney_date tourney_name location lat lng country next_tourney_name next_location next_lat next_lng next_country num_players origin_color
0 20170101 Start of Season NaN NaN NaN NaN Auckland Auckland -36.8481 174.7630 New Zealand 2 (0.586082276047674, 0.586082276047674, 0.58608...
1 20170101 Start of Season NaN NaN NaN NaN Australian Open Melbourne -37.8200 144.9750 Australia 5 (0.586082276047674, 0.586082276047674, 0.58608...
2 20170101 Start of Season NaN NaN NaN NaN Brisbane Brisbane -27.4550 153.0351 Australia 8 (0.586082276047674, 0.586082276047674, 0.58608...
3 20170101 Start of Season NaN NaN NaN NaN Chennai Chennai 13.0900 80.2800 India 3 (0.586082276047674, 0.586082276047674, 0.58608...
4 20170101 Start of Season NaN NaN NaN NaN Delray Beach Delray Beach 26.4550 -80.0905 United States 1 (0.586082276047674, 0.586082276047674, 0.58608...

Part 2: Visualizations

Sankey Diagrams

The first visualization I will show is a Sankey diagram. I've written a basic function that create a Sankey diagram from my tournaments and transitions dataframes.

In [15]:
def create_tournaments_sankey(
    tournaments, transition_ids,
    pad=500, title='ATP World Tour'):
    
    data = dict(
        type='sankey',
        orientation= 'v',
        node = dict(
            pad = pad,
            thickness = 33,
            line = dict(
                color = "black",
                width = 0.5
            ),
          label = tournaments.tourney_name,
          color = 'rgba' + tournaments.color.astype(str)
        ),
        link = dict(
            source = transition_ids.tourney_name,
            target = transition_ids.next_tourney_name,
            value = transition_ids.num_players,
            color = 'rgba' + transition_ids.origin_color.astype(str)
      ))

    layout =  dict(
        title = title,
        height = 3500,
        width = 2000,
        font = dict(
          size = 20
        )
    )

    fig = dict(data=[data], layout=layout)

    img_bytes = pio.to_image(fig, format='png')

    return Image(img_bytes)

I've chosen to start by looking at player travel for all players who attained a peak rank of 20 or less in 2017. This is a population of more than 20 players because players can enter and leave the top 20 every week when the rankings are updated.

Hard court events are shown in blue, clay court events are shown in red, and grass court events are shown in red. Slams are darkest, Masters 1000 events have intermediate darkness, and other events (250s and 500s) are lightest. There are only two green colors because there isn't a Masters event played on grass.

In [16]:
create_tournaments_sankey(tournaments,
                          transition_ids,
                          pad=200,
                          title='2017 ATP Player Travel, Peak Rank in Top 20')
Out[16]:

It's easiest to get used to these kinds of figures by focusing on a single event. Take Indian Wells as an example. The vast majority of top 20 players that competed at Indian Wells came from Acapulco and Dubai. There's also a few players that came from Delray Beach. These are all hard court events. However, there are also a few players that went to South America after the Australian Open to compete on clay. Looking at the South American swing, most players that went played multiple tournaments while there. Few players sought out the South American clay followed by a hard court warm up event before Indian Wells.

Because I'm looking at a high level population, almost all of the players that played at Indian Wells went immediately to Miami. (These Masters tournaments are so close on the calendar that a player that wins both in the same season is often said to have completed the Sunshine Double).

From there, most players begin the clay court season, though one player of note skipped the 2017 clay court season and proceeded directly to the grass of Stuttgart. That player's name is Roger Federer.

In [17]:
player_schedules.loc[(
    player_schedules.tourney_name == 'Miami Masters')
    & (player_schedules.next_tourney_name == 'Stuttgart')
    & player_schedules.player_name.isin(player_list), :]
Out[17]:
player_id player_name tourney_name tourney_date surface tourney_level player_rank location lat lng country next_tourney_name next_location next_lat next_lng next_country
110 103819 Roger Federer Miami Masters 20170320 Hard M 6.0 Miami 25.784 -80.2102 United States Stuttgart Stuttgart 48.78 9.2 Germany

It's also interesting to look at how players arrive at the grey End of Season node. Some top players ended their seasons at Wimbledon that year, though most finished at the Paris Masters or the ATP World Tour Finals (London).

In [18]:
player_schedules.loc[(
    player_schedules.tourney_name == 'Wimbledon')
    & (player_schedules.next_tourney_name == 'End of Season')
    & player_schedules.player_name.isin(player_list), :]
Out[18]:
player_id player_name tourney_name tourney_date surface tourney_level player_rank location lat lng country next_tourney_name next_location next_lat next_lng next_country
652 104527 Stanislas Wawrinka Wimbledon 20170703 Grass G 3.0 London 51.5 -0.1167 United Kingdom End of Season NaN NaN NaN NaN
1100 104918 Andy Murray Wimbledon 20170703 Grass G 1.0 London 51.5 -0.1167 United Kingdom End of Season NaN NaN NaN NaN
1123 104925 Novak Djokovic Wimbledon 20170703 Grass G 4.0 London 51.5 -0.1167 United Kingdom End of Season NaN NaN NaN NaN

While the visualization for players in the top 20 was relatively easy to digest, I'm more interested in a broader look at the sport. I've shown below the same figure, but this time for players that were in the top 100 for at least one week in 2017. I haven't included any events below ATP 250s, so players who appear to skip large parts of the season or end their season early may actually have been playing lower level events.

In [19]:
player_list = peak_ranks[peak_ranks <= 100].index
selection = player_schedules.loc[player_schedules.player_name.isin(player_list), :]
tournaments, transitions, transition_ids = construct_transitions(selection)
create_tournaments_sankey(tournaments,
                          transition_ids,
                          title='2017 ATP Player Travel, Peak Rank in Top 100')
Out[19]:

The US Open Series is often advertised as The Road to the US Open, but for some players that road is paved with grass or clay.

World Maps

One of the most interesting features of the above diagrams is the South American Clay Court Season: Quito, Buenos Aires, Rio de Janeiro, Sao Paulo. Some players, presumably those with better results on clay, head to South America to play on clay, then come to North America for two Masters events on hard courts, and then return to clay in Europe (or Houston). Some skip Indian Wells and Miami entirely (those are likely the lower ranked players).

Geography clearly plays a role in the attendance at these tournaments. If you're already seeking out clay before the (first) hardcourt season ends, and there are back to back tournaments in the area, why not attend all of them?

To get a look at the geography of the ATP world tour, I've plotted the same data on a world map. Colors have the same meaning as before, and lines are thicker when more players take the same path.

World map plotting and boundaries is made possible with the help of Met Office's python module cartopy.

In [20]:
def plot_tour_travel(
    transitions, extent=False,
    title='Player Travel in the 2017 ATP Tour'):
    
    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(1, 1, 1, projection=ccrs.Robinson())
    greys_cmap = plt.cm.get_cmap('Greys')

    if extent:
        ax.set_extent(extent, crs=ccrs.Geodetic())
    else:
        ax.set_global()
    ax.add_feature(cartopy.feature.OCEAN, zorder=0, facecolor=greys_cmap(0.98))
    ax.add_feature(cartopy.feature.LAND, zorder=0, facecolor=greys_cmap(0.85))
    ax.coastlines()

    for i in range(len(transitions)):
        plt.plot([transitions.lng[i], transitions.next_lng[i]],
                 [transitions.lat[i], transitions.next_lat[i]],
                 linewidth = np.log2(transitions.num_players[i]) / 5,
                 color=transitions.origin_color[i],
                 transform=ccrs.Geodetic())

        plt.scatter(transitions.lng[i], transitions.lat[i],
                    color=transitions.origin_color[i], zorder=1000,
                    transform=ccrs.Geodetic())
    plt.title(title, fontsize=20)
    plt.show()
In [21]:
plot_tour_travel(transitions)

I should note that while this looks like a flight map, players may not actually be taking these paths. Players likely return home, or to wherever they do the bulk of their training, during the longer gaps in the season. Players may also travel for Davis Cup events, which do not tend to follow the general geographic movements of the ATP Tour.

Below is the same image, this time zoomed in on Europe. There's so much back and forth in European tournaments that it's hard to follow the sequences of events with this figure, but it still gives an idea of where the events are. Keen observers will note that some events seem to be missing, but what is actually happening is multiple events held in the same city are stacked on top of each other (Roland Garros and the Paris Masters in Paris; Queen's Club, Wimbledon, and the ATP World Tour Finals in London).

In [22]:
plot_tour_travel(transitions, extent = [-20, 45, 30, 60],
                title='Player Travel in the 2017 ATP Tour, Europe')

Part of what makes this so messy is that unlike in South America, high level play takes place in Europe in three different parts of the season: on hard courts after the Australian Open, on clay and grass after Miami, and on hard courts after the US Open and the Asian swing. To get a sense of the temporal back and forth, I've shown below the events played by Alexander Zverev. The color of each path between tournaments gets "hotter" as the season progresses.

In [23]:
def plot_player_travel(df, player):

    selection = df.loc[df.player_name == player, :].reset_index(drop=True)
    greys_cmap = plt.cm.get_cmap('Greys')

    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(1, 1, 1, projection=ccrs.Robinson())

    ax.set_global()
    ax.add_feature(cartopy.feature.OCEAN, zorder=0, facecolor=greys_cmap(0.98))
    ax.add_feature(cartopy.feature.LAND, zorder=0, facecolor=greys_cmap(0.85))
    ax.coastlines()

    # ordinarily looping through a dataframe is bad practice
    # vectorised operations should be used instead
    # however, plotting in different colors requires different calls to plt.plot
    for i in range(len(selection)):
        plt.plot([selection.lng[i], selection.next_lng[i]],
                 [selection.lat[i], selection.next_lat[i]],
                 linewidth = 2,
                 color=plt.cm.get_cmap('plasma')(i / len(selection)),
                 transform=ccrs.Geodetic())

        plt.scatter(selection.lng[i], selection.lat[i],
                    color=plt.cm.get_cmap('plasma')(i / len(selection)), zorder=1000,
                    transform=ccrs.Geodetic())
    plt.title('2017 Tournament Schedule for ' + player, fontsize=20)
    plt.show()
In [24]:
plot_player_travel(player_schedules, 'Alexander Zverev')

Zverev, a European himself, had a heavily European schedule: the Australian Open, hard courts in Europe, two hard court Masters in the United States, the clay and grass season in Europe, the US Open Series in the United States, three hard court events in Asia, and more hard courts in Europe.

This is actually pretty common. The sport that often seems dominated by Europeans unsurprisingly has numerous events in Europe. But not all players have that schedule. Clay court specialists, like Dominic Thiem, tend to head to South America to get in at least one tounrmaent on clay. (He won Rio without dropping a set, earning 500 points for his ranking).

In [25]:
plot_player_travel(player_schedules, 'Dominic Thiem')

Americans like John Isner often play more of their events in North and Central America, and spend less time in Europe for the clay court season.

In [26]:
plot_player_travel(player_schedules, 'John Isner')

Without taking a close look, I mostly assumed that the top players all played pretty much the same schedule, and most of the variation came from lower level players. (I honestly didn't know much about the schedule outside of the Slams and some of the Masters 1000s). While that's partially true, I'd now say that most of the schedule variation comes from lower level events, not lower level players. The Slams and Masters events actually give more cohesion to the schedule than I initially assumed, and players fill in the rest of their schedules according to their preferences, whether that's geography, court surface, ranking points, appearance fees, or something else entirely.

Pages